This .Rmd file pulls AirCasting data from API, clean and tidy them.
Now we have fixed sessions and mobile sessions available. Ideally, we will use mobile sessions to do any analysis, and use fixed sessions as “knots” to validate the measurements. However, we don’t know any models that can do that job. Anyway, I will pull data from mobile sessions first and see what they look like. I will also remove those that are obviously wrong.
I will follow Chris’s method:
Here are the usernames that I found on the AirCasting website:
NYCEJA, BLU 12, HabitatMap, BCCHE, scooby, Ana BCCHE, Ricardo Esparza, Tasha kk, lana, Marisa Sobel, Wadesworld18, El Puente 3, El Puente 4, El Puente 2, El Puente 1, mahdin, El Puente 5, Asemple, patqc, sjuma
usernames <- c('NYCEJA', 'BLU%2012', 'HabitatMap', 'BCCHE', 'scooby', 'Ana%20BCCHE', 'Ricardo%20Esparza', 'Tasha%20kk', 'lana', 'Marisa%20Sobel', 'Wadesworld18', 'El%20Puente%201', 'El%20Puente%202', 'El%20Puente%203', 'El%20Puente%204', 'El%20Puente%205', 'mahdin', 'Asemple', 'patqc', 'sjuma')
user_test <- c('NYCEJA', 'HabitatMap', 'BCCHE', 'lana', 'Wadesworld18', 'patqc')
write a function that takes one username from the username vector, plugs it into the API call, and extracts the session IDs
fetch_id <- function(name){
api_call <- str_c('http://aircasting.org/api/sessions.json?page=0&page_size=500&q[measurements]=true&q[time_from]=0&q[time_to]=2552648500&q[usernames]=', name)
api_pull <- jsonlite::fromJSON(api_call)
user_id <- api_pull$streams$'AirBeam2-PM2.5'$id %>%
.[!is.na(.)]
user_id
}
pulled_ids <- map(usernames, fetch_id) %>%
unlist()
# This function plug each ID into the measurement API call and pulls data using that ID
pull_fun <- function(id_element){
test_sess <- str_c("http://aircasting.org/api/realtime/stream_measurements.json/?end_date=2281550369000&start_date=0&stream_ids[]=",id_element) %>%
jsonlite::fromJSON(.) %>%
mutate(id = id_element) %>%
as_tibble()
test_sess
}
# the output of pull_fun is a list. Take each element of the list and combine them into a tibble
airbeam_data <- map(pulled_ids, pull_fun) %>%
do.call("bind_rows", .)
<<<<<<< HEAD
We did the following to clean the data:
airbeam_data_tidy = airbeam_data %>%
separate(time, into = c("year", "month", "day"), sep = "-") %>%
separate(day, into = c("day", "time"), sep = "T") %>%
separate(time, into = c("hour", "min", "sec"), sep = ":") %>%
separate(sec, into = c("sec", "remove"), sep = "Z") %>%
select(-remove) %>%
mutate(
date = str_c(year, month, day, sep = '-'),
date = as.Date(date)
) %>%
filter(value > 0 & value < 1000) %>%
filter(latitude > 38 & longitude < -70) %>%
filter(latitude > 40 & longitude > -75)
Subset 1: regular measurements
airbeam_reg <- airbeam_data_tidy %>%
filter(value <= 250)
Subset 2: extremely high values
airbeam_high <- airbeam_data_tidy %>%
filter(value > 250)
Our analysis willl primarily focus on the regular measurements. However, extremely high measurements are also useful, because they could help identify potential sources that cause peaks in PM2.5. These two subsets will be analyzed seperately.
Here are the questions that we want to answer in the visualization:
Q1: Where are PM2.5 measured?
Here’s a plot that shows the spatial distribution of all the locations. I have already removed those that are not in New York.
location_reg <- airbeam_reg %>%
group_by(latitude, longitude) %>%
summarize(avg_pm = mean(value))
leaflet() %>%
addTiles() %>%
addCircleMarkers(
data = location_reg,
lat = ~latitude, lng = ~longitude,
color = 'green',
radius = 3
)
Q2: How does PM2.5 observations change throughout a day?
airbeam_reg %>%
mutate(hour = as.numeric(hour)) %>%
group_by(hour) %>%
summarize(pm_avg = mean(value)) %>%
ggplot(aes(x = hour, y = pm_avg)) + geom_line()
Q3: Monthly or seasonal trends?
monthly_averages = airbeam_reg %>%
group_by(month, year) %>%
mutate(average = mean(value))
airbeam_reg %>%
group_by(month, year) %>%
summarize(average = mean(value), pm_max = max(value), pm_min = min(value), observation_count = n()) %>%
knitr::kable()
| month | year | average | pm_max | pm_min | observation_count |
|---|---|---|---|---|---|
| 01 | 2019 | 7.461890 | 148 | 1 | 66033 |
| 02 | 2019 | 8.387737 | 232 | 1 | 98456 |
| 03 | 2018 | 3.203922 | 6 | 1 | 510 |
| 03 | 2019 | 4.750883 | 31 | 1 | 1983 |
| 04 | 2019 | 5.328453 | 23 | 1 | 9015 |
| 05 | 2018 | 14.445833 | 21 | 10 | 480 |
| 07 | 2018 | 10.765362 | 215 | 1 | 160149 |
| 08 | 2018 | 16.740977 | 110 | 1 | 46243 |
| 09 | 2018 | 3.031858 | 33 | 1 | 2825 |
| 10 | 2018 | 3.807418 | 142 | 1 | 18335 |
| 11 | 2018 | 5.563186 | 234 | 1 | 19047 |
| 12 | 2018 | 8.987813 | 41 | 1 | 16083 |
In this table, we see that average PM2.5 peaks in the summertime.
Q4: Where do the extreme values appear?
location_high <- airbeam_high %>%
group_by(latitude, longitude) %>%
summarize(avg_pm = mean(value))
leaflet() %>%
addTiles() %>%
addCircleMarkers(
data = location_high,
lat = ~latitude, lng = ~longitude,
color = 'green',
radius = 3
)